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Abstract 


This report Is concerned with the selection of an estimator which 

Is unbiased when applied to structural parameter estimation (i.e., the 

estimation of a set of unknown parameters contained In a vector h 

relating certain states, Y and X , which are measured with uncertainty). 

The form of this relationship Is known and follows from the structure 

(nature) of some process (I.e., Y » X hi, Structural parameter estlma- 

tlon Is differentiated from conventional parameter estimation In which 

Y Is measured with uncertainty but X Is known exactly. The parameter 

h may vary with time according to the difference equation h » (J>h Q + Tw 

where <|> and T are known and w is a random noise term. If X Is known 

© 

exactly a weighted least squares objective function (J ) Is defined where- 
in the error vector depends on estimates of h, resulting In a conventional 
weighted least squares (CWLS) estimate of h. 

It is shown that the CWLS estimate Is biased when applied to 
structural parameter estimation. Two distinct approaches to bias re- 
moval are suggested; (I) change the CWLS estimator or (2) change the 
objective function. 

Two methods are discussed with reference to the first approach. 

In the subtraction method, the noise statistics are used to eliminate 
the bias approximately. In addition, methods are suggested for 
estimating the noise statistics If they are unknown. Unfortunately the 
new estimator eliminating the bias at the cost of Increasing the 
variance of the estimate. In the Instrumental variable (IV) method, an 
additional measurement Is taken which. If available, removes the bias. 



With reference to the second approach, an augmented objective 
function Is minimized by linearizing the partial derivatives about 
previous parameter estimates. The result is a linearized Iterative 
weighted least squares (LITWELS) technique which is the major contribu- 
tion of this report. The UTWELS estimator is shown to be unbiased 
In an asymptotic manner when the noise statistics are known. Methods 
.of estimating unknown noise statistics are suggested. The UTWELS 
estimator minimizes the residuals associated with the estimate of X 

e 

and Y . A simple example problem Is presented and solved using the 
above methods. Applications are suggested with reference to adaptive 
control, prosthetic devices, and image enhancement. 
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1.0 Introduction 

Figure I contains a block diagram of the basic system considered 
In this report. The composite system Is Interconnected and designed 
to control the system output given the reference Input. The control 
law may result from an optimization criteria, root locus considerations, 
etc. Contained within the system Is a relationship between certain 
system states Y^ and The form of this relationship Is known and 

follows from the structure (nature) of the process. The relationship 
may be written In two ways. 

y e BH Xe (,) 

Y e - X e h (2) 

In (I), Y Is a column vector containing the process output states 
© 

(In contrast to the system output), whereas x Is a vector containing 

© 

the process Input states (In contrast to the system Input). The 

actual parameters contained In the matrix H are unknown. In (2), all 

the states contained In the vector y are reformed Into a matrix X . 

e e 

All -the unknown parameters contained in H are reformed Into a vector 
h. 

As an example, suppose (I) Is 
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Accordingly, (3) can be written In the form of (2) 



The structural parameter estimation problem as discussed In 
this report Is a logical extension of conventional weighted least 
squares theory and Is characterized by the following which refers 
to (2). 

1. Y and X are measured with uncertainty 

e e 

2. h is estimated In a weighted least squares sense 

3. h may be time varying of the form 

h « $h + Tw (5) 

o 

where $ and T are known and w is a random noise term, 

A A 

In this case the estimate of hi Is % ® $h 

o 

4. the estimate of h must be unbiased. 

In conventional weighted least squares (the abbreviation CWLS 
wtll be used) the parameter estimation problem Is simpler than (5) In 

that X Is known exactly. 

0 . •• 

Four basic cases are considered with reference to (5). 


Prob lem 

Measurement 

Parameter 

Designation 

v of X e 

Matrix h 

1 

No I sy 

Time Varying 

II 

Noisy 

Constant 

1 1 1 

Exact 

Time Varying 

IV 

Exact 

Constant 



3 


Problems III and IV are within the realm of CWLS theory, whereas, ni I 
Problems I and M are within the realm of structural parameter 
estimation theory. Problems II and IV are simpler forms of Prob- 
lems I and III respectively and fol low by letting 0 = I and T *» 0. 

In this report. Problems I and III are treated. The results may 
easily be applied to Problems II and IV. 

1. The conventional weighted least squares (CWLS) parameter 
becomes a minimum covariance estimator If the weighting matrices are 
properly chosen (Problem III), 

2. The CWLS parameter estimator Is biased when applied to 
structural parameter estimation (Problem I). 

3. If the CWLS parameter estimator Is changed by the subtrac- 
tion method, the bias may be removed approximately by estimating the 
bias using noise statistics and noisy sensor measurements (Problem I). 

4. If the CWLS parameter estimator Is changed by introducing 
a properly chosen Instrumental variable (IV), the bias may be removed 
(Prob lem I ). 

5. If the objective function Is augmented and If the resulting 

partial derivatives are linearized about previous parameter estimates, 

a linearized Iterative weighted least squares (LITWELS) parameter 

estimator results. If the noise statistics are used properly, the 

TH 

estimator Is unbiased in an asymptotic manner, that Is, If the n 

ST 

estimate Is correct, then the n+l estimate Is unbiased (Problem I). 

6. An example problem Is worked 

7. Unknown noise statistics may be estimated from the residuals 

of the objectfve function. . 

' 8, . Applications are suggested 
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2.0 Conventional Weighted Least Squares (Problem Ml) . 

The sensor equations for measurements of Y and X are 

5 

(6) 

E(w) = E(v) ° 0 
Cov w o 0 
Cov v « R 

Subject to the equations of 5, 6, and 7 we wish to estimate h. To 
estimate h, a weighted least squares objective function Is defined as 
a function of an error e depending on h Q . One estimates h by taking 


Y, - V. ♦ v 


x , ’ X e 


The noise statistics are 


h a $h . 
o 


Ay A 

e Me 


< 8 > 


(Y - X 4>h ) 
s so 


M(Y *Xih ) 
s so 


In order to select h Q to minimize J let us differentiate J with respeict 

A 

to h Q . If we equate the resulting expression to zero, the following 

A 

value of h Q minimizes J for positive definite M. 

he [> T X T MX ♦]”' $ T X T MY (9) 

OSS s s 


With regard to the matrix M which weights the error terms of 

A 

the vector e. Gauss made the following statement In 1809 concerning 
orbital parameters. 

"if the astronomical observations and other 
quantities on which the computation of orbits is based 
were absolutely correct, the elements also, whether 
deduced from three or four observations would be 
strictly accurate (so far indeed as the motion is 
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supposed to take place exactly according to the laws 
of Kepler) and, therefore, i f other observations were 
used, they might be confirmed but not corrected. But, 
since all our measurements and observations are nothing 
more than approximations to the truth . . .the most 
probable value of the unknown quantities will be that 
in which the sum of the squares of the differences 
between the actual observed and computed values 
multiplied by numbers that measure the degree of 
precision is a minimum. 


The "degree of precision" has since been defined to be the Inverse of 

the covariance matrix of e, where 

e « Y - X $h 
s s o 

» x r* ♦ v 

6 T T (l0) 

cov e * x ror'x + r 
e x e 

"optimal " t:V« r V ♦ 


Hence, for low noise values, terms of M are large (High degree of 
precision) and for high noise values, terms of M are low (low degree 
of precision). 

As a result of the choice of M .. .. the covariance of the 

op Tima I 

estimation error becomes simply 

E{<£ - h ) <h - h ) T } [> T X T MX CT 1 HI) 
o o o o s s 

It may also be shown that (9) is an unbiased estimate of h Q , that Is 

E{h } »h ' . . (12) 

o o 


Equation II represents the minimum covariance matrix for the 

A 

estimation error related to h under the following conditions 

o 3 

Class of Estimators Measurement Noise 

weighted least squares zero mean, finite variance 

linear white 

linear and nonlinear white Gaussian 
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Let us now consider the effect of uncertelnlty In measuring 

A 

upon the expected value of h o » 

3.0 Bias of the CWLS Estimator For Problem I 

In addition to Equations 5, 6, and 7, let us assume that sensor 

measurements of X contain a noise N 
e 


X « X + N 
s e 

E(N) « 0 


(13) 


E(NN') a S 


The sensor equation for Y g can be written 

Y a (x - N)$h + X Tw + v 
s s o e 


(14) 


If (14) Is substituted Into (9) and If the expected value Is taken, 
the following Is the result where the noise sequences w, v, and N are 
assumed to be uncorrelated 


E{h } = Cl - T3 h 
o o 

T a E{[$ T X T MX $]“ 1 $ T X T MN$} 
s s s 


(15) 


We conclude that when N = o (CWLS parameter estimation) the result is 
unbiased but when N i 0 (structural parameter estimation) the result 
Is biased and the bias must somehow be removed. 


4.0 The Subtraction Method (Problem I) 

One obvious method suggests Itself for removing the bias present 

In (15): premultiply the estimator by Cl-T] *. The expected value of 

h would then become 
o 

E{h } a O-Tr’Cl-TD h^ = I: (16) 

o o o 

which is an unbiased estimate of h . 
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Note that for Problem I, T (the bias) depends on i, M, X and 

© 

the noise statistics. We must therefore know the deterministic signal 

X , however, the entire bias problem Is caused by noisy measurements 

of X , referred to as X . Hence, any practical bias removal method 
© s 

A 

Involving the matrix T requires the estimation of T (l.e. T) as a 

function of the noise statistics and X , The net result Is that the 

s 

* 

bias Is approximately but not completely removed using T. Suppose that 

A . ' • 

T Is chosen from (15). 


T * 0 T X T MX 4> T t$ 
s s 

T « E{X T MN} o E{N T MN} 
s 


(17) 


The following estimator results from premultiplying (9) by Cl-T3 

A 

where T Is defined In (17) 


I 


h H o\ T hX $ - ♦ t T*T i * T X t my (18) 
oss s s 

Note that the term ^ 7$ attempts to ’’subtract off" the effect of the 
bias, hence the term "subtraction method." 

Although the bias Is approximately removed by the subtraction 
method, there are four objections to consider. (I) The covariance of 
the estimation error Is Increased as compared to the biased CWLS 
estimator. (2) The above matrices contain k sets of data for multi- 
stage processes. Subtracting the probabilistic means and variances 
Included In T when k Is small may result In grossly Inaccurate results 
since the sample means and variances may be quite different from their 

i 

probabilistic counterparts. As more samples are included In the above 
matrices, then this method becomes more useful since the sample means 
and variances approach the true means and variances. (5) The sta- 
tistics may be unknown, hence procedures must be sought for estimating 
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the necessary statistics. (4) Selection of the weighting matrices 
Is not at all well defined as In Problems III (Equation 10). 

5.0 The Instrumental Variable Method (Problem I) 

Let us now consider the use of an additional variable whose 

purpose Is to achieve an unbiased estimator for Problem I. This 

method is called the Instrunental variable (IV) method since the 

additional variable Is an instrument for the desired result^*®' 19,20]^ 

For Problem I one simply rewrites the CWLS algorithm (9) to 

Include the Instrumental variable (Z). The Instrumental variable 

replaces X In such a way that the resulting estimate of h Is unbiased, 
s ^ o 

h o » [$ t z t mx $3"'Dl> T z T j my s (19) 

The Instrumental variable must be uncorrelated with the noise terms N, 

v, and w. If Z Is thusly chosen, then substituting (14) Into (19) and 

taking the expected value results in the proof thBt E{h Q } ® h Q , that Is, 

we have an unbiased estimate of h . 

o 

One advantage of the IV method Is that no statistics need be 

known. to remove the bias. It Is necessary, however, to find ways of 

choosing an Instrumental variable or forcing some variable to 

approach X The paper by Young^ 0 ^ describes an Interesting hybrid 

(analog and digital) scheme. The end result is that Z approaches X by 

proper adjustment of the model parameters. If Z Is highly correlated 

with X , then the weighting matrix M is chosen as suggest (10), l.e., 

© 

m - CzrQr T z T + r]. 

in summary, an instrumental variable Z is used as per (19). If 
Z Is uncorrelated with the noise terms, the resultant estimator Is unbiased 
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at the expense of the covariance matrix for the estimation error. If 
Z is highly correlated with Z , the covariance matrix for the estimation 
error may be reduced by selecting M as above. 

6.0 The Linearized iterative Weighted Least Squares Technique (Problem I) 
Let us consider a new approach for achieving unbiased structural 
parameter estimates. This approach follows directly from a weighted 
least squares minimization problem. It may be shown that unbiased 
results follow from a specific selection of the weighting matrices, 
similar to the weighting matrix selection of (10). Recall that the 
basic linear relationship between the states and parameters can be 
written in two ways 

Y » X h (20) 

e e 

- HX e <21 > 

In view of (21) let us write the sensor equation for x s 

Xs ° x e t n 

E(n) - 0 (22) 

cov n *> S 

In (5) the time variation of the parameter vector h was defined. 

In a slml lar manner, the time variation of the parameter matrix H may 
be defined 

H » H © + WD (23) 

o 

where W is a random noise term. Subject to Equations 5, 6, 7, 13, 22, 
and 23, we wish to estimate h. To estimate h, a weighted least squares 
objective function J Is defined by two equivalent expressions. The 
first term of each expression for J depends on estimates of the noise 
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sequence n whl le the second term depends on estimates of both nrand 
b 


One estimates h by taking h = $h^ . 

A ■ A A 

J = n M^n + y My 

. J » rpMjn + CY g - (Y - N)4>h o ] T M[Y s - (X g - N)*h Q D 


(24) 


« n T M,n + [Y s - * q Q% - n)] T MCY s - H6(x s - n)3 
The first expression above may be differentiated with respect 

A 

to h Q while the second (equivalent) expression may be differentiated 

A 3J 3J 

with respect to n. At this point, the partial derlvates — and — r 


3h 


3n 


form a set of nonlinear coupled equations since N depends on n and 

A 

H q depends on h Q . It Is possible to linearize the partial derivatives 

A A A A 

and decouple the variable n from h Q by assuming that N and H q are 

A A 

constants determined by the last estimates of the variables n and h Q . 

The derivation of the solution to (24) begins by selecting 
TH ' * 

the n estimate of n to obtain N . It Is useful to define 

n 

3T « X - N (25) 

s s n 

The next step Is to differentiate (24) with respect to h Q to obtain 
an equation not Involving n. The solution of this equation for the 

A 

n+l- 5 «„ estimate of h follows 

o 


h , . = C4 > T x t mx ♦T , » t x t >Y 
o(n+l) UY s Y Y s s 


h n+l !=' ^ h o(n+l ) 


(26) 


Having a value of ^ 0 ( n+ |j we Immediately use that to obtain By 

A 

differentiating J with respect to n we obtain an equation Independent 
A 

of h which may be solved for the n+l estimate of n 
o 


Cm, + e T H T MH eT'CeVMH e x - 6 'h'my 1 

a A A A <5 


■lr«0, 


T*T 

* i i 'ii 


n+l 


o o 


o o 


o s* 


s 


(27) 



ft 


The algorithm of (25) - (27) Is therefore Iterative and begins 
with an Initial estimate of N. If the Initial estimate Is zero the 
first estimate of h Is Identical to the CWLS estimate. 

In summary we have developed a linearized Iterative weighted 
least square algorithm, which may be abbreviated LITWELS. This method 
Is both Iterative (requiring a fixed amount of data storage) and 
recursive. In contrast, the CWLS algorithm Is only recursive (that Is, 
given a fixed amount of data, only one optimal filtered result occurs). 

The following procedure may be used to demonstrate that the 
algorithm of (25) - (27) converges In expected value to h^ In an asympto- 

A 

tic manners assume that h 0 ( n+ u ” h Q and show that the expected value 

A 

of ^ 0 ( n+ 2) * s h o* *^° SO ' however » requires substituting the noise 

A A 

estimate n^ + j Into the parameter estimate ^ 0 ( n+ 2 ) requires some 

A A A A 

way of relating n to N and H to h . These relationships are difficult 
to establish In general since elements of the column vectors must be 
arranged Into matrices. It must be stressed that the LITWELS technique 
may converge In general but that proving so is quite difficult. Conver- 
gence can be established for problems with a scalar output (l.e., each 


of the sets of data contained In Y g satisfy a single equation linearly 

relating one output state Y e j to n other states X Q j and n parameters 

TH 

hj). Hence, for the I set of data 



(28) 
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scalar). Since X Q and Xq ma Y b® related* then the sensor noise terms 
N and n may be related. Similarly the parameters H q and h^ may be 
related. It Is now possible to demonstrate the convergence of the 
LITWELS algorithm as folldws 

A A A A 

1) Let h h so that H » H . Calculate n , , 

o(n+l) o o o n+l 

2) Substitute N , Into < 25 ) to obtain h . 

n+ I o(n+zJ 

3) Take the expected value of the result 

m I 

The resulting expected value is of the form E{A B) where A 
and B are matrices. Expected values of this type are quite difficult 
to evaluate, even when the probability density of the noise terms are 
known. Let us assume, however, that A and B contain relatively large 
amounts of signal and relatively small amounts of noise. Hence, A * 
and B are correlated only through the noise terms. If the higher 
moments of the noise terms are negligible (as when the noise has a 
Gaussian amplitude), then one can approximate the desired expected 
value by E{A *}E(B) since A * and B are nearly uncorrelated. If this 

A A 

Is done, then E{h } <* h when (as per (10)) 

o 


m *» [ Y ror5c T + rT 1 

s s 

Mj « s ” 1 


(28) 


Note that complete knowledge of X implies that S approaches zero 
(little sensor noise is present), hence approaches zero for all 
Iterations, hence (25) becomes the optimal conventional weighted least 
squares estimate as we would expect. 

The above asymptotic convergence property of the LITWELS 
estimator, is a rather weak property. It may be possible to establish 
the following. 
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A 

1. If the expected value of h equals the expected value 

A. 

of li, the expected value must be h . 
on o 

2. The expected value of the estimator converges uniformly to 
h Q . This phenomenon was observed In an example problem. In which the 
mean of the LITWELS algorithm converged exponentially to the correct 
parameter value. 

r The storage requirements of the LITWELS algorithm Is of Interest. 

When Y contains scalar measurements and h is a constant parameter vector 
s 

of dimension n+l, the CWLS estimator of (9) requires the storage of ■ H - J . fAl 

2 

running sums. It Is possible to combine (25) - (27) Into one equation 
for the constant parameter case. As a result, only one additional 
running sum must bej stored for the LITWELS estimator. The data storage 
requirement Is therefore minimal for the LITWELS estimator. 

7,0 An Example Problem 

The analysis presented In sections 1-7 appears to treat a single 

stage process with one set of measurements. However, the matrices Y , 

0 

X , h, etc. may be defined to contain k submatrices, each one defined 

for one stage, that Is, the equation Y 0 » may be defined to contain 

k equations of the form Y . ** X , h,. Hence, the analysis Is also valid 

el el I 

for a multistage process. For example, (9) may be written In terms 
of a k stage process where , X g , etc., each contain k components. 

< 29 ’ 

In the following multistage examp I e problem, the parameter is 
constant, hence «* I and » 0 for al I stages of the process. The 
parameter estimation equations may be simplified accordingly and written 
using simulations similar to (29). 
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Suppose we are In an au+omobi le traveling at a constant, but 
unknown, velocity. The distance yj at time tj Is related to the time 
tj by the constant velocity v, that Is 

Yj B +j v (30) 

In order to estimate the velocity, distance measurements are taken 
by reading the mileposts at five second Intervals, Suppose, however, 
that a random error exists In the spacing of the mileposts. Suppose, 

In addition to the random distance measurement error, the clock has a 
random error associated with Its timing, hence the distance measurements 
may be taken at actual time Intervals of 4.95 sec, 5.10 sec, 5.15 sec, 
4.9 sec, etc., even though the clock shows elapsed time Intervals of 
5.0 seconds. This type of parameter estimation problem Is termed 
structural parameter estimation since the actual structure (parameter 
matrix) of the process relates quantities which are known with uncer- 
talnlty. 


Let us define y m j as the> measurement of y t and let us define 
t^j as the measurement of tj. Now suppose we take k sets of measure- 
ments. A weighted least squares fit to a plot of y ffl j versus t j may 
be used to estimate the constant velocity. The weighted least squares 
objective function J corresponding to this problem follows from (8) 

J " “i ‘''ml - »> 2 (3, > 

Let us assume that each of the error terms Is weighted equally (all 

A 

the Mj are equal). The CWLS value of v which minimizes J follows 
from (9) or (29) where the Mj cancel. 


A 

V 


Zt m| V 


If 


2 . 
ml 1 


ml 


( 32 ) 
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From (15) wo know that the expected value of v Is (l-T)v where 
T may be defined from the random clock error 

E+ ml N +t 

T E3 E(— S'-t i) (33) 

It' 

ml 

In order to remove the bias due to T, let us estimate T as per (17) where 

2 

the variance of each noise Is o . 

T - -tp- (34) 

Using this estimate of T, the subtraction method estimator (18) follows 
by premultiplying (32) by 1/(1 — T ) • 


/S 

V » 




(35) 


n. 


The term ko "subtracts off" the bias due to the noisy clock readings. 

Let us now consider the instrumenta I variable method where an 

additional measurement Is used. For this case, the measurement must 

be uncorrelated with the random distance and clock errors. After 

each .ml lepost reading suppose we write down the last digit of the 

license number of the next passing car. As an Instrumental variable 
th 


Zj for the 


reading, let use use I times the selected digit. Since 


the digit Is uniformly distributed In the Interval 1-10, the mean v 
value of Zj is 5i, which Is also the mean value of the it** time measure- 
ment (since the readings are taken at 5 second intervals). Although 
Zj Is uncorrelated with the random time and milepost placement errors. 

It Is also uncorrelated with t,, which degrades the covariance of the 
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estimation error but removes the bias. The IV estimator, as applied 
to this example, follows from (19), 


- £ Vmt 


(36) 


Let us consider the LITWELS appraoch to this problem. The 
objective function (24) Is as follows where all the are equal to 
Mj and all the Mj are equal to m 

(37) 


J - Zh 2 ,™, ♦ I[y ml - (t n| - n^O 2 ™ 


TH 


If we have an n estimate of the clock measurement errors, then the 
n+l S * estimate of v follows from (25) and (26) 

A 

t , «* t , - n . , 
ml mi tl 


(38) 


V . * 

n+l 


^m I v m I 


st 


With the above velocity estimate we can select on n+l estimate of 


each measurement error from (27) 

'2 

V 

n 


V n+| m+ mi V n+I n %nl 


tl 


(39) 


n 


m, + rfiv 


n+ 1 


If o Is the variance of each time measurement error, then ni| should 

2 2 " ‘ ” 

be l/o . If o Is the variance of each distance measurement error, 

2 

then m should be l/dy. Equations 38 and 39 can be combined, resulting 
In the LI TV/ELS estimator for the unknown velocity problem. 

Vl' l “2“ iT ml Y ml + v n+l iy2 

, - -2 2 

n+2 


A 

V 


2 2 

v n+! } ^ + Vl^l 1 


( 40 ) 


+ 2 v n+ | £t mi y ml 


V . ,2,t_,y_, + v_. ,Iy, 


2 

n+l _, ml 


a 



17 


In comparison with the CWLS estimator of (32), only one additional 

2 

running sum must be stored, that Is, the si/nmatlon of the terms y m j. 

It Is Important to note If the time measurement errors are 

2 

small, ° approaches zero, hence the subtraction estimator (35) and 
the LITWELS estimator (40) both approach the CWLS estimator (32), For 
some case, as the Instrumental variable Zj approaches time values tj 
the IV estimator (36) approaches the CWLS estimator. In sunmary, all 
three estimators reduce to the CWLS estimator when time Is correctly 
measured. 

Figure 2 contains the results of a simulation concerning a 
constant parameter problem slmi lar to the above but of a higher 
dimension (2 parameters were unknown). Plotted are the sample means 
for 20 repetitions of each estimator. 

The preceding theory Is verified by Figure 2. The CWLS 
estimator Is obviously biased from the correct value. The subtrac- 
tion method results In a less biased estimator. The IV estimator 
Is unbiased. A perfect Instrumental variable was chosen. For the 
unknown velocity example, a perfect IV may be defined as a variable 
2 | equal to the correct time value tj . A practical case where Zj 
must be generated would result In a power performance. The LITWEL 
estimator’s sample mean converges toward the correct parameter value 
and. In fact, does so exponentially. 

The above remarks support the conclusion that the four types 
of estimators can be ranked as follows In ascending order of their- 
Success In providing unbiased estimates with a low estimation variance 
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1. LITWELS 

2. IV 

3. Subtraction 

4. CWLS 

8.0 Unknown Covariance Matrices 

We have considered three methods for achieving unbiased estimates 
with regard to the structural parameter estimation problem. The Instru- 
mental variable method requires no knowledge of. the noise statistics. 
However, both the subtraction method and the linearized Iterative weighted 
least squares (LITWELS) method require knowledge of certain covariance 
matrices. This section considers the problem of estimating the covari- 
ance matrices when the statistics are unknown. Residuals (errors and 
squared errors) are unknown. 

For Problem I It Is necessary to estimate the covariance matrices 
Q, R, and S which are defined for the noise vectors w, v, and n respec- 
tively. Recall that each noise vector may correspond to a k stage pro- 
cess so that w may contain k terms of dimension of k^ x I, v may contain 
k terms of dimension k y x I , and n may contain k terms of dimension 
k^ x I. Suppose that the covariance matrices Q, R, and S are all non- 
symmetrlc, that Is, the noise sequences are correlated and nonstationary. 

There must be enough equations to estimate all the elements of Q, R, and 
2 2 2 

S, hence (kk^) + (kk^) + Ikk n ) equations would be required. The 

resulting large number: of equations would be most difficult to solve. 
Suppose that the covariance matrices Q, R, and S are diagonal, that Is, 
the noise sequences are uncorrelated but not necessarily stationary. 

For this case, one would be required to generate kk j + kk£ + kk^ equations 
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to estimate the diagonal elements of Q, R, and S. There would still be 
a rather large number of equations involved since k represents the num- 
ber of stages. Suppose that the noise sequences are unco^related and 
stationary. In this case, only kj * ^2 + ^3 stations would be required 
since k ( , k^, and k^ elements along the main diagonals of 0, R, and S 
would be unique. For the purposes of this discussion, one additional 
simplification Is used to reduce the complexity of the analysis such 
that only lener products (scalars) need be evaluated. It Is assumed 
that the ratio of the elements of Q, R, and S are known, hence one need 
only determine scalar multiplying factors q, r, and s In order to 
completely specify all the matrix elements of 0# R, and S. As a result, 
only three scalar equations must be generated If we assume that 

1. The noise vectors are uncorrelated and stationary. 

2. The covariance ratios are known. 

In order to generate two of the three scalar equations, one 
may utilize a modified CWLS estimator In which Q and R are set equal 
to Identity matrices in (10) and the resulting value of M Is used In 

A 

(9). For the first equations, the sum of the residuals e from ( 8 ) may 
be equated to the expected value of the sum (which depends on s, h Q , 

T, and X ). Since X , not X Is available, the following approx!- 
mat! on to this relation Is made 

A 

Ze s fj (s, h Q , *, T, X g ) (41) 

For a third equation, J may be evaluated for a CWLS estimate 
where M In ( 8 ) and (9) Is an Identity matrix. The result (let us use 
J I to specify the new objective function) Is set equal to an approxi- 
mate expression for Its expected value 
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J ( *» f 3 (s, r, q, h Q , 4, J\ X g ) 

In summary, we now have three equations in tour unknowns (s, r, 
q, and h o >. 

Using the above three equations, let us consider how the sub- 
traction method and the L I TWELS method may be adjusted for the case 
where the noise covariances are unknown. 

For the subtraction method, recall that we need an estimate 
of the bias matrix T which estimate depends on s, 4, T, and X s< The 
modified CWLS estimate may be set equal to an approximate expression 
for Its expected value 

h Q = Cl - T (s, 4, r, X s )3 h Q (43) 

Equation (41) may be solved for s as a function of the unknown h and 

o 

the result may be substituted into (43). A Newton Raphson approach is 

then necessary to obtain h Q from (43) since (43) Is a nonlinear function 

of h , 1 

o 

For the LITWELS method It is necessary to use all three covariance 
matrices hence all three scalars must be obtained. The following scheme 
Is suggested 

1. Set h Q equal to the latest LITWELS estimate. Start with 
the modified CWLS estimate. 

2. Solve (41) for s using h Q from I. 

3. Solve (42) and (43) for r and q. Use s from 2 and h Q from 
I. Use a Newton Raphson approach 

4. Obtain a LITWELS estimate using the covariance matrices 
resulting from 2 and 3. 


5. -Repeat' the above steps. 
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The general method suggested In this section may be applied 

to the example problem reported In Figure 2 wherein the noise statistics 

are assumed to be known for a 2 dimensional unknown constant parameter 

problem and estimates of one of the parameters are plotted. The unknown 

statistics case Is shown In Figure 3. The CWLS and IV estimators which 

do not use the noise statistics for bias removal are repeated here for 

reference. Note that the subtraction method estimator Is less biased 

In Figure 3 than In Figure 2. This occurs because the subtraction method 

Is highly dependent on the accuracy of the noise statistics and, as 

shown In Figure 4, the sample variance of one noise term Is somewhat 

2 

different than the constant (a^ ) used In adjusting the CWLS estimator. 
For the unknown statistics case, the subtraction method procedure Is 
able to Identify the sample variance of the noise as shown In Figure 4, 
hence the resulting estimates are less biased as shown In Figure 3. 

The LITWELS method Is degraded somewhat i I h I ts asymptotic appraoch to 
the correct parameter value. For the constant parameter problem. It 
Is necessary for the LITWELS method to estimate r and s whereas the 
subtraction method estimates only s. The resulting performance Is 
apparently degraded by this requirement. Performance, however. Is 
quite satisfactory, ..... — 

9.0 Applications 

Structural parameter estimation has been discussed analytically, 

an example problem has been worked, and simulation results have been 

presented. Parameter estimation problems which can be cast Into the 

form (5) can be solved using the methods of this report. In general. 

If we are concerned with a process Y » X h where both Y and X are 

0 © © © 
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measured with uncerfalnlty and where the time variation of h may be 
modeled as In (5), structural parameter estimation techniques are 
applicable. In this section, problems related to adaptive control, 
prosthetic devices, and Image enhancement are discussed In relation 
to the structural parameter estimation. 

9.1 Adaptive Control 

An adaptive control system has two Inter-related goalst M) 

estimate the system parameters, and (2) derive a control law, using 

the parameters, which causes the closed loop system to perform In some 
r 2 6l 

desirable manner L . The system "adapts" to changes In Its structure 
by adjusting the control law In an appropriate manner. Figure I may be 
related to an adaptive control system for the structural parameter 
estimation problem by adding a disturbance at the input and by using 
the parameter estimates to generate the control law . If It Is possible 
to describe the corresponding parameter variations by (5) and If noisy 
measurements are taken of the states (Y and X ) related by h, then 
the adaptive control -parameter estimation problem becomes one of struc- 
tural parameter estimation. One may estimate h Q using the methods of 

A A 

this dissertation and then utilize h » 4>h o to generate an appropriate 
control law. Let us consider applications which call for the use of 
adaptive control 

(I) The handling characteristics of a vertical or short take 
off and landing (VSTOL) aircraft on take off are similar to that of a 
helicopter, but. In level flight, the handling characteristics are 
similar to that of a conventional airplane. It is desirable to generate 
a control system such that handling characteristics are constant. 
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(2) A shu+tlecraft system has been proposed as an economical 

T 28 29~\ 

way of transporting men and material to and from space stations •' . 

Each boost vehicle is to **f iy ,t back to earth for future use. The 
shuttlecraft which rides "piggy-back" with the booster is to be highly 
maneuverable such that Its occupants may land at certain pre-selected 
locations. Each suttlecraft (somewhat like a flying bathtub) must 
maneuver with precision since the earth’s atmosphere is reentered at 
a very low angle and since the landing speed is in excess of 100 miles 
per hour. The aerodynamic conditions under which the shuttlecraft 
must operate are indeed varied, since the craft must fiy from a near 
vacuum down to sea level causing the handling characteristics to vary 
according |y. 

(3) There are cases where unmanned spacecraft might utilize 
adaptive control. For example, spacecraft attempting unmanned landings 
on Mars and Venus encounter large aerodynamic variations. American 
Venus probes have Indicated that surface atmospheric pressure is 75 to 
100 times that of Earth^^. 

9.2 Prosthetic Devices 

Recent medical research has utilized the ml I 1 1 voltages generated 
by muscular contractions (myoelectric voltages) to actuate artificial 
limbs. The patient learns to use certain muscles to manipulate the 
limb. 

A law governing artificial limb movement can be defined as a 
solution to a structural parameter estimation problem. Suppose the 
vector Y g is defined to be sensed myoelectric voltages. Let the con- 
stant parameter vector be defined to be the three desired torques (t) 
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and the three desired translational forces (f> at the time when the 
voltage measurements are taken. ' 

C44) 

Suppose we define a connection matrix X g which relates the myoelectric 

voltages (muscle contractions) to the six degrees of freedom (h Q ). A 

random error vector e can be defined as 

e » Y - X h (45) 

s so 

The voltage measurements (Y g ) contain a random error. The connection 
matrix (X s ) Is an approximation to the true relationship between the 
muscle contractions and the desired movement of the limb. Since we 

A 

have satisfied the conditions for structural parameter estimation, h o 
In (9) may be a biased estimate of the desired limb movement, making 
It necessary to use the methods of this dissertation to remove the bias.' 

9.3 Image Enhancement 

With regard to received radio signals, filtering methods are 
used for removal of Interference due to transmission of the signal. "> 

The received signal, a time function, contains the desired transmitted 
signal plus a noise term (the Interference) which Is strong In certain 
finite frequency bands. 

An analogous problem exists when a picture Is transmitted through 
space. The Mariner 4 took pictures of the planet Mars and reduced each 
Image to a grid of numbers which were encoded and transmitted to Earth 
where the signals were decoded. The resulting data can be treated as 
a function of distance (by recording the numbers at intervals on a 
sheet of paper). The string Is a sensed function y g (x) 
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y g (x) » y(x> + s(x) (46) 

where y g (x) contains the desired signal y(x) plus a coherent noise term 
s(x) which Is strong In certain frequencies. Current techniques remove 
the coherent noise by using convolution filtering In the spatial Fourier 
Transform domain. Conventional least squares parameter estimation can 
be used as an alternate scheme. 

Let us assume that the frequencies of the coherent noise terms 
have been Identified from the power spectral density of y s (x). For each 
frequency, the variation of s(x) with respect to (x) can be written 


• 


• 

s(x) 

= 0 

s (o) 

s(x) 


s(o) 

• 

• 

L -J 


The values of s(o) and s(o) determine the amplitude and phase of the 

signal s(x). If we consider the vector on the right side of (47) to be 

a constant parameter vector h Q , then we can estimate h Q to minimize a 

TH 

least squares error where the I error Is 
e, • y s (x,) - [OlD «| h 0 

A A 

The result of determining h Q I s a function s(x) which, when subtracted 
from y s (x), results In an estimate of y(x) which Is optimal in a least 
squares sense. 

10.0 Conclusions 

It has been demonstrated that the conventional weighted least 
squares estimate Is biased when applied to the structural parameter 
estimation problem. The three methods of bias removal, In order of 
effectiveness, are 

1. ' tlTWELS 

2. IV ! 

3. Subtraction 
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The LITWELS method and the subtraction method require the use of noise 

covariance matrices. When these matrices must be estimated, the LITWELS 

technique is somewhat degraded whereas the subtraction method may actually 

be Improved at the expense of greater complexity. The IV method does 

not use the covariance matrices for bias removal, but the IV must be 

generated to be uncorrelated with the noise terms and correlated with 

X . If minimum variance is desired for the IV method, the noise covariance 
_ 0 

matrices must be known or derived. 
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ON SEQUENTIAL SEARCH FOR THE 

Maximum of an unknown function 

S. Yakowltz 


1. INTRODUCTION 


Many problems arising In engineering and operations research contexts 
have the following structure: The decision maker is provided with a class 

E of. functions, whose common domain, x» Is specified. Some mechanism selects 
a function £ from F. The decision maker is not Informed of this choice. He 
would like somehow to find a point x*eX at which f assumes its maximum value 


(denoted by 


IHI>- 


Toward this end, the decision maker may sequentially and 


without constraint select elements Xj.x^,... from X. Upon choosing x^. 


he 


Is Informed of the value f(x n ). Thus he may come to learn certain features 
of f. Any (perhaps randomized) strategy for choosing x n on the basis of the 
sequence of pairs {(x^ ,f (Xj ) )}j will be termed a search procedure . 

The problem of finding a search procedure S under which, for all f eF , { f (x n )} 
converges to J | f j |, In some specified sense, has generated a lively body of 
research papers, some of which will be referenced and described in the present 
paper. 


As an example of the sort of engineering question giving rise to a search 
problem, suppose that an airplane Is to fly with a fixed velocity. Its fuel 
efficiency will then be a function of the carburation setting. If x is' the 
relative mixture of fuel and air and f(x) the associated fuel consumption 
required to maintain the aircraft's velocity, then the framework for a search 
problero is present. For this problem, X may be taken to be the unit interval 


and F» perhaps, may be considered to be the set of continuous functions on 
the unit Interval. 
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Under certain restrictions on F and X, effective search procedures have 
been revealed. The most publicized of these is the "gradient method" which. 

In Its simplest form, determines x^ + j from Xj by estimating the gradient Vf of 
f at Xj (by difference approximations derived from local samples) and then 
setting Xj + 1 = Xj + X V f(Xj). X is a scalar chosen from heuristic considera- 
tions and may vary as the process evolves. If the functions in f are concave 
or at least untmodal and X Is bounded and sufficiently regular, the gradient 
method can perhaps provide a Cauchy sequence (f(Xj)) converging to||fJ|. 
Hadley's book Nonlinear and Dynamic Programming EG devotes a nicely written 
chapter to the gradient method and its variations. The review paper by Spang 
C2D has an extensive bibliography on the gradient method, more recent 
techniques of which are described in the book by Osborn and Kowi lak C3D. 

J. Kiefer C4,53 has published interesting analyses for the case thatx 
Is a bounded Interval In the real line. In particular, under the search 
procedure he proposes, in n trials (the number n must be specified in advance) 
the point x* at which f(x*) = jjf 1 1 can be located within a distance of l/l n , l- n 
being the nth Fibonacci number, when F is the set of unlmodal functions on 
CO, G. Further, the search procedure Is minimax In the sense that no non- 
randomlzed strategies can Improve on this operating point error uniformly 
In F. Bellman and Dreyfus C6D devote a chapter to this optimization approach. 
To this writer's knowledge, an analgous search which also possesses the- 
mlnlmax property has yet to be revealed for multi-dimensional X* 

An Intriguing search model (which Is slightly closer to the path to be 
followed here In that probabilistic Ideas are prominent and multi-modal 
functions are Included In F) was proposed by H. Kushner C7,8i] who supposed f 
to be a sample function from a Brownian motion process on a bounded linear 
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interval, X* An advantage to this viewpoint is that, in addition to 

including multi-modal functions. Ideas from Wiener prediction theory can be 

brought to bear on the problem of designing an optimal search procedure. 

Kushner points out that numerical evaluation of the optimal procedure is 

computationally prohibitive, but suggests (without proof) a search procedure 

under which lim l/n I n f(x.) = If 1 1, almost surely. 
n-*» t = l 

The research reported in this paper follows an approach sketched by S. 
Brooks C9U. Presumably, Brooks took X to be a bounded subset of a Euclidean 
space, and the loss associated with the function feF and operating point 
xeX to be 

L(x,f) * relative (with respect to X) volume of points x' such that 
f(x') > f ( x ) . 

Then, given any positive numbers and, a smallest number N is readily 
calculated such that If Xj, X 2 ...X N are selected uniformly from X* Then for 
any real -valued function f, 

FtmaXj < j <n L(Xj,f). > c!l < d, for n ^ N. 

Brooks, as well as Kushner, consider the possibility that the measure- 
ments m ay be corrupted by additive noise. These considerations will 

be detailed, along with a brief review of "stochastic approximation" in a 
later section (Section 4) of this paper. 

Let us loosely summarize the results of our investigation. (X,A) will 
be a measurable space, and M, the set of measurable functions on X. P is a 
probability function on (X, A). Examples show that no search procedure 

achieves f.(Xj )■*■ JJf j| , even over the continuous functions on the unit interval 

✓ 

or functions f on a Countable X. However, a search is presented such that 


i 

i 

i 
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for all feM, f(X )-*■ f In P-probabi 1 1 ty. Also, we reveal a search which 

n 

achieves P-almost-sure convergence to f of the terms l/n Z f(X.h 

1*1 

The section closes with a description of various Important subsets of M 
for which there are searches achieving Cauchy convergence of (f(Xj)} to, ||f.|| 
Generalizing the Idea of Brooks mentioned above. Section 3 proposes, 
as the loss associated with operating point xeX and criterion feM, the function 
L(x,f) = PC{y:f (y) > f(x)}3. The motivation Is that we are able to derive 
upper bounds on the number of search Iterations needed to achieve some given 
level of performance. Specifically given positive numbers c and d, we compute 
searches Sj and S 2 and numbers N| and such that under Sj, If n > N 

PCL(X n ,f) >c] < d; ' 

under S 2 , 

n *» - 

PCs up l/n Z f(X ,f) > cU < d 
n>N 2 1=1 1 

Section 4 generalizes the search problem previously discussed by allowing 
* 

that the observations of f(x) may be corrupted by measurement noise. In the 
first theorem of the section. It Is shown that If the measurement noise Is 
additive and Identically and Independently (of x and f(x) as well as previous 
samples) distributed, then under our search procedure (which is Independent of 
the noise distribution) 

f (X j ) I jfj J, In P-probabi 1 1 ty, 

and 

L (X j , f ) -*> 0 . . j • 

. .. v ! 

In P-probabl 1 1 ty. Also, for positive e, c, and d, a procedure Is revealed 

for finding the sample size N such that, under the search described, if n>N 

PC y:f (y) > f(x n ) + e > c]<d. 



39 


This last result does require that the noise distribution be known. 
In Theorem 4.6, the noise is allowed to depend on x and f(x), but various 
assumptions are made about Its mean, median, and variance. 


2. ON THE EXISTENCE OF CONVERGENT SEARCHES. 

We first Introduce the notation and terminology to be used in the 
sequel. Let (X,A> be a measurable space and M the set of real-valued 
measurable functions or X. We shall always assume that each singleton set 
Is In A. Let G be a subset of M and let jl|fj}|= sup x£ ^f(x) for feM (liffill ? + 00 
Is possible). (We note that||f)|ls not a true norm as it may be negative, 
for example.) A deterministic search procedure is a collection of measurable 
mappings (m k ;k = 0, I, 2, ... of X XR Into X (where R is the real line). 
Given a deterministic search procedure, for feG define inductively x(0,f)=mQ 
and x(k+l,f) = m^ (x(0,f), .. , x(k,f), f(x(0,f)) ... f(x(k,f)). We say 
that G has a deterministic search If a deterministic search exists such that 
!> f(x(n.f)) ~ I |f I J for all fe(j. Of course, the intuition behind this 

definition is that x(n,f) Is the next point at which we observe the value of 
f after having observed f(x(j,f)), j=l,2,..., n-l. 

A random search procedure consists of a mapping m k (B;X|, . , x k , Y|»*»y k * 
defined for BeA and x { eX and yj R, k=0, 1,2, .... Further for fixed, 

Xj, ... , x k , Yj, ... , y k , m k (.;X|, ... , y k ) is a probability measure on 
(X,A) and for fixed BeA, m k (B ; . ) Is a measurable function on X^R^. 

We Interpret . ; Xj, . , x k , y^ . , y k ) as the conditional probability 
distribution of X k+ | if we observe Xj, . , x k , f ( x j ) = yj, ...^, f(x k ) = Y k * 
For each feG we may find a probability distribution on the sequence space 
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oo k 

X by defining a consistent family of measures on X for each k. Let 
pj e and inductively, 

p£ (A X B) « <A ; x k _ ( f (X| ) , . . . , f(x k _ ( ) 

P k- | ( dx | dx2 , ••• , dx k _ | ) 

k M I oo 

where AeA and BeA . Let P^ be the resulting probability measure on X • 

We say that G has an almost sure search if there is a random search 

such that for all 'feG, 

p ,<«v HWI> ■ 

oo 

where is the identity function on the n_th coordinate of X . We say that G 

has a search in probability If for all feG# lim P.tHX^) e N(|jf|J)> *> I for 

each neighborhood N(||fj|) of ||f|| (with the usual neighborhood system at 

Infinity). If j |f J J is finite this is the same as requiring that I 

Mm P f (f(x n ) - | Jf || | < e) = 1 for each e > 0. 
rr+°° 

In most spaces If we consider G 3 M, then It Is too much to hope for 
any sort of convergence since a function may be large at a "small" set of 
points. To get around this trouble It Is convenient to allow the function 
to be arbitrarily defined on a small set. Let P be a probability measure 
on (X,A) and for each f e M I et | j f J jp be the P-essential least upper bound 
of f. The measure P may take into account a priori knowledge of which x values 
are Important, but we will not elaborate this point. 

We say thatG has a P-aimost sure search if there is a random search 
such that for all feG, 

P f (|im Inf f(X n > 2l\\f 1 1 p) » I. 
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We say. that G has a P-search in probability If there Is a random search such 
that lim P f <f(X n ) e (||fjl p - e ,+«))= i 

n*Kx> 

for each e>0. (The obvious modification holds I f jjfj | p = + °°. ) 

We first consider two examples to show the trouble one may have finding 
search procedures. 

Examp le 2.1 ; Let X be countable with each single point set In A • Let 6 
consist of functions taking rational values onX* Let P put positive measure 
on each one point set. Then G does not have P-almost sure search. 

Proof: Let f be the Indicator function of (x) where P((x)) > 0. Find N 


such that P,(X * x , n > N) > 0 and then find x., . , x.. . such that 
Pf L( (Xj ,. ,x N _j ,x,x,. . )} ) > 0. Consider g(z) = l If z = x, z ® 2 If z * y 
where P({y}) > 0 and y (Xj, . , x n _|}» and g(z) = 0 if z / x or y. Then 

P g (g(x n ) <• HsUp) <.l. 

Example 2.2 : Let X * [0,l3, A= The Borel field of C0,l3, G = continuous 

functions and P be Lebesgue measure. Then 6 does not have a P-almost sure 
search. ( 

Proof: Let f have, a unique maximum at I . If -*jjfj| almost surely 

then there Is some interval l£ [0,1/23 such that P^(X n ^ I, n = I, 2, ...) > 0. 
Consider any continuous function g which agrees with f outside of X and takes 
Its maximum in X. It is easy to see that g(X n ) -*■ IMHI.ll with positive 
probability. 

Note that In the two examples I HI P - IM| for all f eG. Further, since 
each single point set Is In A any deterministic search procedure Is also a random 
search procedure. Thus, In the two examples G does not have a deterministic 


search. 


1 
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There are at least two ways of getting around this problem: I) 

Consider smaller classes of functions, (e.g. unimodal C5]. 2) Use 

different criteria for convergence. 

We now see that a P-search In probability Is possible even when 
G = M. 

Theorem 2.3 : Let G * M. Then for each P, G has a P-search In 

probabi 1 1 ty. 

Proof : Let Xj, X 2 , be Independent Identically-distributed random 

mappings each with distribution P. Then for each f 

P,( max f(X. ) - ||f||p> =1. Cl) 

The proof is completed by using the following lemma. 

Lemma 2. 4 : - Suppose that G has a random search such that (I) holds. 
Then G has a P-search in probabi lity. 

Proof : We sketch the proof. Let Yj be Poisson random variables 

with parameters Xj -*■ + 00 which are mutually independent of each other and 

* 

Independent of Xj,X 2 , ... . For each n, let X n = value of X|, .. , X n 
which maximizes f(Xj). Consider the random sequence 

» * * ... - 


Yj - times Y 2 - times 

We can find "new" m, which lead to the same distribution as the random 

k 

sequence just given. It is easy to verify that this random search works to 
give convergence in probability. 
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fThe same method of proof yields: ~! 

Lemma 2.5 : Let each f e G have P-ess inf f(x) >-® and (I) hold; then 

G has a random- search procedure such that 

p (i I t (X,) *||t.|| p ) =1 
1=1 

for all f e G. 

In reference to Theorem 2.3 It is the opinion of the authors that in 
practice, convergence In probability Is as usefulas convergence almost 
surely. In either case one would like some information on the rate of con- 
vergence (a point we return to later). 

Let us turn to the other approach of finding searches by restricting 
the class G. The following results are all easy. 

Lemma 2.6 : Let X be an arbitrary topological space. If for every e > 0 

there is a finite collection *£f n of 56+5 union X and points 

Sj> InEj, 1 = 1, 2, ., , n such that 

sup sup |f(S.) ” f ( S ) | < e 

feG SeEj 

then G has a deterministic search. 

Proof : Lete. = 1/2*, 1=1, 2, ... . Find E|(l), ... , ^ n (|)^^ sets 

associated with C|. Let Xj » Xj eE^(l),..., X^j = x^j eE n( D U). 

Let f(X*) = max f(X ). If f(X*) - f(X.) > 2e. deleted from the space- X 
1 \<J <n ( I ) J J 3 J 

al I over again with the new space and etc. 

Coftbl I ary 2.7 : Let X be compact, metric and A the Bore I sigma-fleid. Then 
any equl contl nuous family of functions G has a deterministic search. 

Proof : As X Is compact, G is uniformly equi continuous. - J 
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Cprol I ary 2. 8 : Let X be compact and metric and A the associated Borel 

sfgma-fleld. Then any compact subset (with respect to .the sup-norm metric) 
of the continuous functions on X has a deterministic search. 

Proof : Ascol i-Arzela theorem and Corollary 2.7. 

Corol I ary 2.9 : Let X be compact and metric and G uniformly satisfy a 

Lipschltz condition. Then G has a deterministic search. 

Proof : By Corollary 2.7. 

Corol I ary 2. 10: Let X be a compact, differentiable manifold, A the 

generated sigma-field and all f in G have uniformly bounded derivatives. 

Then G has a deterministic search. 

Proof : By Corollary 2.7 since the family is equi conti nuous. 

Corol I ary 2.11 : Let X be a compact metric space, A the Borel field of X 
and let C(X) be the continuous functions on X with the sup-norm. Let y 
be a probability measure on C(X) with its Borel sigma-field. Then for each 
e > 0 a deterministic search may be found such that 

y( f : f (X n ) -+ ||f|| ) > I - e. 

Proof : First note that we may think of f as a sample path from a 

stochastic process with domain X. From the conditions of X it follows that 
CCX) is a complete separable metric space (ClO, pp 94,103]), and hence y is 
a tight measure C • I ) * Thus, we may find a compact set K such that y(K) > I - e. 
Use Corollary 2.8 on K. 

Observe that if X = C0,l], the sigma-field of this process is the same 
field that is generated by the usual "product-field" construction (a proof of 
this statement Is in Parasarathy Cl2] p. 212). In particular. Corollary 
2,11 is related to a study by Kushner C7] which proposes a search for 
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finding the maximum of Brownian motion sample functions. '• 

The problem of characterizing subsets 6 that have a deterministic 
search is quite interesting, but the authors have not been able to make 
much progress. The problem appears to lie In the domain of mathematical 
logic. ' 

Note that under the conditions of Corollary 2.11 we may find a count- 
able dense set of points in X. Let P be a measure putting positive mass 
on each point of the set. Then for continuous f, 1141 = ] | f| [ p and Theorem 
2.3 and Lemma 2.5 hold for the stochastic process. 

3. RATES OF CONVERGENCE 

In applications of sequential search procedures it Is highly desirable 
that there be some way of assessing what can be done In a finite number of 
Iterations. For example, one would be interested in knowing, if possible, 
how fast f (X^ ) •* |Jf||. In this section we consider questions of this sort. 

To see the difficulties involved we consider an example In random 
search. 

Examp le 3. I : As a criteria of the amount of convergence one might 

consider ||f|| - f (X n > or (| Jf| | - f(X n ))/ J Jfj J . More general ly, we will use 
g(||fj|, ftX^)) where g(x,y) is a function satisfying: 

I) for fixed x, g(x,y) is strictly decreasing as y approaches x from 
i below. 

2. g(x,x) =0 

3. For fixed y, g(x,y) is strictly increasing as x increases (where x y) 
with a limit >_ I for a 1 1 y as x -*• 00 . 

To get a grasp on the rate of convergence one might hope to find a random 
search such that for each c > 0 and 0 > d > I, there Is a number N(c,d) such 


P f (g<| jf| |, f<X n >) > c) < d . 

We now show that this cannot be done if X * C0,lD and G is the set of 
continuous functions. Let c < 1/2, 0 < d < I, and n be any fixed integer and 
some random search procedure also be fixed. Pick any f c G and let I be an 
interval such that 

p f u nx., ...» x n = 6 )> d. 

Let h e 6 agree with f on the complement of I and gj|hjj, ||fj|) > ^hen 

we have 

P (g(J| hi/, h(X )) > c) > P (gllhfl, h(X )) > 1/2) 
g t n — g n 

> PgChiXj ) = f(Xj), 1=1, 2, . , n) 

> P ( X,, . , X n H I = *5) = P f ( X,, . ,x n n I = «5) 

> d 

ending the example. 

A great weakness in the theory of search procedures Is the fact that for 
G the class of continuous functions on E0,lD, under no search procedure can 
bounds on the rate of convergence of f(X n ) to II fll or f(Xj)/n tojjfjjbe 

established which are uniform on G. The practical consequence of this weakness 
is that the experimenter cannot estimate the level of performance attainable 
In a finite number of search iterations. One approach to overcoming these 
difficulties Is to redefine the search problem by proposing a different (but, 
hopefully, not unreasonable) criterion of goodness. 

We do this by following some of the ideas Implicit in Brooks C9.D* Asso- 
ciated with each operating point x e X and f e G Is the set a(x,f) = { y : f <y ) 

> f(x) , which is here called the domain of improvement (of f over f(x)). 
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As In section II, let a probability measure P be given onCt(X,A). As a loss 
function we propose the P measure of a(x,f). That Is, 

L(x,f ) - P <ct(x,f)>. 

Strictly speaking, L should also contain P as a variable, but since P wl I ! be 
fixed we shall omit this notation. 

Thus, L(x,f) is the probability that a person choosing a point Y at 
random In X with distribution P will find that f(Y) > f(x). If X Is a set of 
finite volume In R and P Is proportional to volume (l.e., P Is proportional 
to Lebesque measure) then L(x,f) Is the fraction of the volume on which f 
exceeds f(x). 

We find that for certain search procedures It Is possible to obtain 
Information on how close L(X n ,f) Is to zero. We will say that Xj, X 2 , .., 
are chosen at random If X|, X 2 , ... are Independent, Identically distributed 
X- valued random mappings with distribution P. Let f e M be fixed and for 
each n define n* by I < n* < n and 

fCX ,) * max f(X ) . 

0 I <_l <n 1 

Proposition 3.2 : Let X|, X 2 , ... be chosen at random and 0 < a < I; 

then for each I nteger n and f e M , 

P, (L(X # ,f ) > a) < ( l-a) n . 

Proof : Let t = sup £t: P({x:f(x) > t^) > a| then P({x:f(x) >_ t ) <_ a. 
Thus, P f (L(X n# ,f) > a) ® P f (f <X, ) < t* , I < I < n) = fr P%:f(x)< t'}) < (l-a) 
By considering any random variable f(X) with a continuous distribution function 
we see that equality may hold. 
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With this criterion of accuracy, the rate of convergence Is Independent 

of the dimensionality of the space X. With other criteria of convergence 

this might not be true. This point is discussed further by Spang C9, page 

3623 who uses two concepts of convergence and appears to doubt Brooks* C^3 

comment that the rate of convergence Is independent of dimensionality. 

Further Information on the rate of convergence is contained in the next 

theorem. In what follows, M Is defined to be the random variable L(X *,f) 

n n* 

defined In Proposition 3.2, and X is distributed, randomly. 

Theorem 3.3 : Let f(X) have a distribution function F such that for some 

£>0, F(x) I - e Implies F Is continuous at x. Then nM n converges weakly 

to be exponential distribution with parameter I. 

Proof : Let a > 0; then for large n, 

P (nM < a) = P, (M < a/n) 
f n — f n — 

* I - P,<M > a/n) = | - n n P,(L(X.,f> > a/n) 
f n . , t I 

1 = 1 

» I -n n P c{x:f(x) > t }) = |-( l-a/n) n 
i = l " 

where t = sup {t:P( x: f(x) > t }) > a/n . This approaches I • e 3 as n + " 
completing the proof. For a > 0, by Taylor*s theorem with remainder on the 
logarithm of e X /(l - x/n) n , we see that (large n): 

exp(-a 2 /2n) < e a /(l-P^(nM n >_ a)) < exp(-a 2 /2n + a^/6n 2 ) 

In the same vein as Lemmas 2.4 and 2.5, it Is shown that we can use 
Proposition 3.2 to get searches which converge at a known rate. 

Theorem 3.4 : One may compute a search procedure Sj under which, for 
any positive numbers c and d, a number N(c,d) may be found for which 



P£sup l/n E L(X.,f) > cD<d 

n>N(c,d) 1 = 1 

for every feM. 

Proof : Let {n(l)}°° be a sequence of numbers such that n ( I > = I and 

1 = 1 

l/n(!) converges to 0 monotonical ly (e.g. 2* * ). By Proposition 3.2, we 

may compute a number N’ such that 

(c/2) Cn(M , )-N' , )/n(N")D +Hn ( N» , )+N ,, )/n(N , ')D < c. 

Search procedure Sj requires that X be sampled Independently with distribution 
P at times t»n(J) (J=l,2,...), and for iyn(J),x^ Is chosen to be the best 
value In the sequence { X n ( j )} sampled thus far: f(x + ) = max {f(X v > : v t} . 

Thus evidently f(x.), t t (n(j)} Is monoton 1 ca I I y Increasing In t. Observe 
that from the choice of N' and the definition of Sj , 

p Q-(Xn( N ,), f) > c/2 D < d. 

Let Q be the event (with reference to the process determined by Sj on f) that 
L(X (n* )» f > 5 c/2. If 0 occurs, then by the choice of N” (and observation that 
L(x,f) £ I, always) 

SU W \'/ n . L(x i> f) i c : 

1=1 . 

In summary, 

• n 

p Csupn> N „ | = | l/n CL(Xj,f) > c] < P [Q C D < d, 

and consequently the theorem is proved, with the understanding that N' 
suffices for N(c,d). 
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Theorem 3.5: One may compute a search procedure , under which, 

for any positive numbers c and d, a number N(c t d) may be found for 
which 

PCL(X ,f) > cD < d 
n 

r •- f 6r all h > N(c,d) ahd'all feM . 

Proof ; Let {n ( J ) } be a sparse sequence as In the proof of Theorem 
3.4. From this we construct a random sequence{N(J ) } where N(j) has the 
sample space {n ( j ) , n(j)+l, n(J )+2, . . . ,n ( j+ 1 )-l } and Is chosen by the 
randomization which assigns equal probability to each element of this 
sample space. $2 is the search procedure which samples X independently 
and uniformly at times In (N(J)}. At other times, x^ Is chosen to be the best 

operating point thus far sampled. The condition Imposed on (n(J)} that 

l/n(j) converge monoton I ca 1 1 y to 0 as j tends to infinity ensures us that a 
number N' can be found such that 

PCN(j)=n]] < d/2 for a 1 1 J > N', all Integers n. 

From Theorem 4, a number N" may be found such that PCM^h > <0 <1 d/2. If 

k = maxtN'+I.N"} then for n > n(k) 

• P[KX n ,f) > c3 < PCM n „ > c] + P[ne[N(j)}3 < d 

ending the proof. _ . 

Without going into detail It Is clear that in results 2.6 through 2.10 
one may find integers N(e) such that if n > N, jjfjj- f (X ) < e for all feG. 

•m* 

A modification of Corollary 2.11 also holds). In order to find N one must 
know quite a bit about the structure of G» For example, in Corollary 2.8 
one must know the compact set. In general, there is no one search which 
works for all compact sets. If one knows the compact set, not only may a 
convergent deterministic search be found but also a uniform bound on the time 
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necessary for any degree of convergence. 

In closing we note that If G consists of uniformly bounded measureable 

functions, then the central results of this section obtain under the loss 

function L'(x,f) ^ ] f(y)P(dy), where B ® { y s f Cy ) 5> f(x)} 

= B 

Proposition 3.6 ; Let g be a function I ntt][ having a finite expectation 
with respect to P. Assume for every feG, f £ g, and that Xj Isa random 
sequence. Then for every positive c and d, a number N may be computed such 
that for every feG, In the notation of Proposition 3.2. 

PCL'(X NJM f) > cD < d. 

Prodf: By the assumption that the Integral of g Is finite, one can 

find a positive number k such that If PCaD < k, 
j^g(x) P(dx) < c. 

N 

If N Is such that (l-k) < d, from the proof of Proposition 3.2, we know 

that with probabl 1 1 ty greater than l-d, L(X N *,f) < k. By the definition 
of L and the choice of k, this means that with probability greater than l-d, 

j^g(x)P(dx) < c, (A « x:f(x) > f(X N *) (3.1) 

Finally, as g majorizes f, (3.1) gives us (letting Ah x:f(x) > f(X N# ) 

L'<X N *,f) = H f (x)P(dx) < | g (x)P(dx) < c 

From this proposition, the other major results of Section III follow with 
l’ replacing L, with at most minor modifications of the proofs. 

4. SEQUENTIAL SEARCH USING NOISY MEASUREMENTS 

In this section we consider the problem of the earlier sections with the 
additional complication that errors of measurement are present. To be more 
specific. If X- Is the nth operating point, the decision •maksr obso rvss • 
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f(X ) + Z (X ,f (X )) (4.1) 

n n rr n 

where Z^tX^tX^)) Is a random variable conditionally Independent of Xj,.., 

X_ ,, Z., , Z , (given X and f ( X )) whose distribution Is conditional 

n-l I ' n-l 3 n n 

on the values of X and f(X ). We will assume that If X, = X,, then Z, and 

n n I j' I 

Zj have the same distribution. 

Physically, f(X n ) + Z n may be regarded as arising from a noisy meter 

which measures f(X n ), the noise being dependent upon the operating point X n 

and f(X ) the value at X . "Noisy measurements" refer to observations of 
n n 

the form (4.1) (In contrast to f(X ) which Is considered a "noiseless 

n 

measurement"). 

The basic Idea In the section Is the standard one of replicating 

observations to minimize the effect of observational error (see. e.g. Brooks 

[9 3). We consider several different cases. The first case Is when the 

measurement error does not depend upon X or f(X ), The distribution F 

, n n z 

of the error Is assumed unknown In the next theorem. 

Lemma 4.1: In the noisy measurement case, let Z|, ... be unknown 

Independent Identically distributed random variables Independent of 
(X j , f (X | ) ) (X^jffX^)), .. for each f e M. One may compute a search 
procedure under v/hfch, with P-probabI 1 1 ty I, as n + •' 

n ( 

l/n Z L(X ,f ) -*■ 0 
1=1 

n 

(thus l/n Z f (X j ) H |f 1 |p if f Is P-bounded below) 

1=1 

for each feM such that P[f(X) =|Jf|p = 0. 
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Remark: For piecewise continuous functions f, this last restriction is 

satisfied if f does not assume its maximum on a plateau. 

Proof : The description of the search procedure uses the following 

notation: {u.(n)} is an observation of a sequence of independent values {U(n)} 

P~distributed on X. Class F denotes the empiric distribution function 

“ » J 

constructed from the observations which, during the first N observations of 
the search, have been made at u(j), j ■ 1,2,... . (An empiric distribution 
function F^ constructed from any sequence {x^} of n real numbers is the 
cumulative distribution function determined by the expression 
nF (x) » number of elements x. of (x, } n such that x < x. 


i 


i=l 




^ is the cumulative distribution function (cdf) for the random variables 
f(u(j)) + Z; i.e., 


F u (j)(z) “ F z (z+f (u(j))) , for every real z. 

More generally, F x is the cdf of f (x) + Z. If H(x) is any real function, 
let | |H| I* ■ sup x£ ^ |H(x) I . (K(v)} is a sequence of integers such that if 

n>K(v), then for any cdf F, and empiric distribution function F^ constructed 
from n independent observations distributed as F, 


P[| |F-Fj | * > 1/v] < 2" V /v. 

Massey [13] gives an algorithm capable of computing a minimum such number 
K(v). 

(M(v)) is a sequence computed inductively by the following rule: 

M(2) « 1. 

M(v) » M(v-l)+A(v)+v K(v), v > 2 


where A(v) is some positive integer such that 

[M(v-l)+v K(v)+(v+l) K(v+1)]/A(v) < 1/v (4.2) 

Having described (K(v)} and (M(v)}, we are in a position to reveal the search 
procedure S^. 
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Step I : 

For each Iteration v, v » 2,3,..., of these Steps 1-3 the points 

kn^n«M(vI K ^ V * are chosen » a+ each n * f rofn se+ °* points (u(J): J = 

l,2,..,v ), so that each u(j) is sampled K(v) times. Therefore, by time 
N • M(v) + vK(v) . 

PD|F N,j " F u(J)ll # - ,/v » J- 1 * 2 **-*^ > 1 * 2 ” V - (4 -3> 

Step 2 : 

At time N = M(v) + v-K(v), a positive integer v* v is selected such 
that for every real number z, 

F n (z) > F^ ^ (z) - 2/v for I £ k £ v. (4.4) 

If no such v* can be selected, v* Is chosen arbitrarily. 

Step 3 : 

At times n, M(v) + vK(v)<n<M(v+ 1 ), X n = u(v*). At time M(v+I), repeat 

the process, with v increased by I. Toward outlining a proof that S^, as 

Just described, possess the property asserted in the theorem, it is necessary 

to recognize that with probability I, (4.4) will hold for all but finitely 

many v. For demonstration of this, let u(v') be any positive integer not 

greater than v such that 

f(u(v*)) * max.' . . f(u(j)). 

I £ J <_ v J 

Then for all z and all i<v, 

F . . , ( z ) » F 7 (z+f (u(v' ))) > F ... (z) = F v (z+f(u(l ))). 
u(v' ) Z — u ( 1 ) Z 

The event (which will be denoted by B(v)) that 

\ • • ' r. ... f * 

. M f n,'j " F u(J)lf * - l/v » 1 1 j 1 v {4 - 5) 
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implies, by the triangle inequality, that for J <_ v, 

F. t , (z) * F w , (z) - 2/v, all real z 
N, V N,J 

and thus (4.4) holds with v* = v'. Note that by construction of {K(v)}, 

00 00 

E P(B(v) C ) < E 2" v < » . 
v*»2 v*2 

and consequently, by the Bore I -Can te 1 1 1 lemma, B(v) occurs for all but 
finitely many v, concluding our assertion that for all but finitely many v, 
v* can be picked to satisfy (4.4). We will hereafter assume without comment 
that v* always has the property (4.4). As our only concern Is with limit 
theorems, this assumption will not lead us astray. - 

The completion of the proof that leads to the convergence of 

n 

l/n E L(x.,f) to 0 Is at hand. By the choice of M(v) and A(v), we have that 

i«l 

at all time Q during the vth Iteration of steps 1-3 (v>2) that 

CNumber of Observations Xj, l<_l£Q, taken at (v-l)* or v*3/Q>(v-l )/v, 
and thus for a I I n >M(3), 
n 

l/n E L(x.,f) < l/v+((v-l)/v) max {L(u(v*),f ), L(.u( (v-l )*) ,f )} (4.6) 

t=l 

The proof is completed by showing that almost surely, 

L(u(v*),f ) -*■ 0. 

Let x' be any point In X such that L(x',f)>0. Then almost surely some 
u(h) In an observation of {ll(v)} gives f(u(h)) > f(x'). If H Is a number 
such that * 

6/H 4l p x . -F u(h) Jlii' 

iThen for al l v > max{H,h} , If f (u(J) ) <_ f (x* ), 

F N,v* <2) i F uCh» <z) ~ 2/v > F u(J) (z) * 6/H - 2/v 
> F N , <z> + 6/H-4/v > F 

» SJ 


(z) t 2/v, (al I real z). 
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which implies that j cannot be chosen to satisfy (4.4) for v*. From this 
we deduce that 

1 1 m sup L(u(v*),f” <_ L(x’,f). (4.7) 

Let (w ) be a sequence (whose existence Is Implied by the hypothesis that 

PCf (X) = | |f| | p 3 a 0) such that L(w n ,f) > 0 and L(* n ,f) 0. Then (4.7) 

holds almost surely simultaneously for ail the w (In place of x') and we 

n 

conclude that with probability I , 

Itm L(u(v*),f) < inf L(w ,f) *» 0 
* — n n’ 

Theorem 4.2 : Under search S^' described below, Lemma 4.1 remains 

true In the absence of the hypothesis PCf(X) = J [p] “ 0* 

Proof : S^' differs from only in step 2, where for S^' the restriction 
Is made that v* be the greatest positive integer v such that for every 
real number z, * 

F n v «(z) > F n k (z) - 2/v, l<k<v. (4.8) 

Observe that S^’ is a version of S^, and consequently It achieves con- 
vergence under the hypothesis of the preceding theorem. 

In the absence of a sequence {w n } as described In the proof of the 
previous theorem, there Is a number t' such that 

PCf>t*3 ■ 0 and PCf » t'3 > 0. (4.9) 

(The abbreviation PCf > bD is used to denote the P-probabi 1 1 ty of the 
domain of improvement {x:f(x) > b}). We use the notation of the proof to 
the preceding theorem. Let h be an Integer (surely there Is one) such 
that f(u(h)) » t' . Then for v > h, under S^', v becomes v* by virtue of 
one of the events A(v) or B(v) (In the sigma-field of the process 


determined by and f) occurlng: 
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f(0(v)) «* t'* 

B(v) « B^/v) O B 2 (v). 

t’ > f (u(v)) > t' . a(v) 

v satisfies (4.4) 

Here a(v) - inf (a:|| F , - F | J * <. 2/v}, || * being the sup norm. 

x a “ 

Note that P[A(v) CJ B(v)] _> P[A(v)] => P[f°t' ) which is positive and 

independent of v. Thus under S^' , during evolution of the process 
infinitely many different v are chosen as v*. Our proof consists of show- 
ing (below) that 

lim v P[B(v) A(v) U B(v) ] » 0 (4.10) 

Note that A(v) and B^(v) are independent of {U(k):k^v}. Thus (4.10) 
implies. that lim v P[F(0(v*)) » +' ) " 1» which in turn implies that 

4 

(L(^(v*),f)} converges in probability to 0. This (in view of equation 
(4.6)) concludes the proof. 

We proceed now to the demonstration of (4.10). 

P(B(v) | A(v) (J B(v)] ^PfB^v) j A(v) U B(v)] 

-Pit' > f(U(v)) > t’ - a(v)]/P[t’ > 'f(U(v)) > t‘ - o(v) ]. 

As (a(v)} converges to 0 mono toni cal ly, by the continuity property of 
measures* 

lim v Pit' > f||(t>(v)) J f’ - a(v) ] » 0. 

Similarly, 

lira v Ptt' > f (f/(v)) > f' - a(v) ] - P [ f (t)(v) ) » t’l > 0 

Thus P(B^(v) | A(v) B(v)I *► 0, which in turn implies that 
P[B(v) | A(v) UB(v)] -*- 0. 


where 

and 


A(v) 
B(v) : 


B 1 (v) 1 


B 2 (v): 
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. Corol I ary 4. 3 : Under the conditions of the theorem, one may compute 

a search procedure such that the results of the theorem still obtain, 
and further 

L(Xj,f) -*■ 0. 

(and consequently f(Xj) -+j [fj J p ) In P-probablllty for all feM . 

We describe the modifications of S^' which achieve the result, leaving 
verification to the reader. For v=l,2,..<, choose the K(v) samp li ng. times 

i 

randomly (i.e. uniformly) fromM(v), M(v)+I,..., M(v+|) - I. At the 
remaining times between M(v) and M(v+I), let X ( t ) = U((v-I)*). Observe that 
the sample times become sparse. 

Proposition 4.4 : Given positive numbers c,d, and e, and F^,, the 

common distribution of the independent noise samples, there is a number 
N(c,d,e) such that for n > N(c,d,e), under the search described below, 
for a 1 1 feM, 

p.ocf > fcx > + e: > c : < d. 

t n ..... 

Proof: { U ( j )} Is a sequence of Nj independent, 'X-valued, P-di stri buted 
observations, where Nj Is a number large enough to assure (In accordance 
with Proposition 3.2) that 

P[mtn l(U(J ),f ) > c] < d/2. 

J < M 

Let h be a mapping with domain C-I,l3 such that h(a) = F ^ +g . Then h 
Is I to I and continuous with respect to the Prohorov metric (Prohorov [14] 
on the space of distribution functions. Consequently h * is uniformly 
continuous. 6 is defined to be a modulus of continuity associated with e 
(assumed less than I). is a number such that, in the notation of the 
Glivenko-Cantel 1 1 Theorem, for n > 
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’ P[ | F - F n | » > 5/2] < d/2N ( . 

N=N.N 2 and our search consists of sampling at each point U(J) and then 

letting N* be the number J such that for some x and all k 

F. kl (x) > F. kl (x) - 6/2. (4.11) 

k,N 

At times greater than N, the operating point is chosen to be U(N*). Toward 
showing that the strategy has the property given in the theorem, let U’ 
denote the observation U(k) which minimizes L(ll(j),f). With probability 
greater than I - d, simultaneously 
U) L(U' , f ) < d 
and for I<J<N . , 

^ F j>N _F Z+f (U(J)) 

Assuming (i) and <M) hold, by the triangle inequality and rudimentary 
properties of the translation parameter family F^ +a , we see that (4.11) 
implies 

•I F Z+f(U(N*)) “ F Z+f(u')!l^ 

*"H F Z “ F Zt(f(U' )-f(U*)) F W 

which. In view of the fact that the sup norm majorizes the Prohorov metric 
and also the way 6 is defined, implies f(U') < f(U(N*)) + e. This completes 
the proof. 

We offer below some further refinements in the noisy measurement case. 
The proofs are only sketched as the ideas are similar to proofs already 
used. 

■\ 

A 

Let ° f(X ) + Z . the nth observed value. By f(n) we will denote 
n n n — ' 

’ A 

any estimate of ||f!lp. That is, f(n) Is a measurable function of 


1 1 * < 6 / 2 . 
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(Xj, .. ,(X n# 0 j, .. , 0^). The basic Idea In the next two results Is the 

standard one of replicating observations to minimize the effects of 

observational error (see Brooks C93). 

Theorem 4.5: Let the Z be l.t.d.r.v.'s with a known distribution 

n 

/V 

function F^. One may compute a search procedure S and estimates f(n) such 
that 


I) f(n). 


p a.s. (Pj) for at I feM. 


2) L(X n ,f) -*■ 0 t.p. (P f ) for all feM. 


3) Z L(X,,f)/n -*■ 0 a.s. (P.) for all feM. 

1° 1 1 T 

Sketch of the proof : Pick Z a unique p th percent! le for F^» that 

Is F Z (Z) >_ p and F-^Z'*!) f_p (for some fixed p, 6<p< I ) . At any particular 
x by using the Kolmogorov-Sml rnov approach and observing the p th percentl le 

A*. 

of f(x) + Z n , n»l, .. , N(e) we may estimate f(x) by the p th percentile f 
In such a way that : ' . *: ; 

F f ( | f-f (x) | <e) < l-e 

Let YjjYjf... be t.I.d. X valued r.v.’s with distribution P. We 
proceed In Iterations as In earlier theorems. During the n th Iteration 

A A 

we have estimates f^, .. , f R of f(Yj), .. , f(Y n >, respectively, all 
within with probability > I - e n< During the n+lst Iteration most 
observations are at a point among {Yj, .. , Y^} picked at random from 

A A A 

among those Yj satisfying fj > max {fj, • • * - e n * The other 

observations give estimates of f(Y|), ... , f(Y n >, f(Y n+ |) to within 
e n+ j with probability > I - e n+ | 
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During any given Iteration, f(m) 

A 

of the fj of the previous Iteration. 

By choosing the e n such that the Borel-Cantel I l lemma holds and by 
choosing (during each Interat ion) a smaller and smaller fraction of Xj's 
to be used In estimating the f(Xj)' s one can show that the results of 
the theorem hold. 

Theorem 4.6 : For all XeX suppose that the noise Z(x,f(x)) satisfies 

one of the following: 

2 

a. ) Z(x,f(x)) has a n(y, (x,f(x)>) distribution, y known 

2 

a (x,f (x)) unknown. 

b. ) Z(x,f(x)) has a distribution which Is symmetric about the known 

unique median y. 

c. ) Z(x,f(x)) has a known mean y and variance bounded uniformly above 

Then one may find a search procedure and estimates f(n) such that 

I), 2) and 3) of Theorem 4.5 hold. 

Sketch of proof : The thing to note Is that in a), b) and c) for 

a fixed point x, e, X If we repeatedly sample X + Wj , Wj independent 
and Identically distributed with distribution the same as Z(x,f(x)) then 
for each e > 0 we can find a stoplng rule T and T-measureable estimates • 
f (x) of f(x) such that 

P f <|f(x)-f(x> |< e)> l-e. 

In a) one could use the t-vari able, In b) use the ideas expressed In 
Kendall and Stuart, Cl 53 pages 513-522 and use Chebyshev’s Inequality In 
case c). 

We thus can bpp-ly the same Ideas as In Theorem 4.5. 


the estimate of 


111 * 


Is the maximum 
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If y were unknown It would still be possible to satisfy conditions 
2) and 3), but not condition L 

We close this section by mentioning related studies. We have slated 
that In Kushner theory C7D It Is supposed that f Is a sample function of 
a known Brownian motion process. If Is further allowed that the measure- 
ment may be corrupted by Gaussian noise having zero mean and a known 

variance, which Is allowed to depend on the operating point x. The frame- 

2 

work for computing an optimal search procedure minimizing EC(jjf||- f(x n )) 3 
Is sketched, but It Is not proven that these methods yield convergence of 
the above expectation to 0. 

Our studies are also somewhat related to the subject of "stochastic 
approximation," Initiated by Monro and Robbins CI&D and placed In an 
optimization setting by Kiefer and Wolfowitz Cl?3. A definitive survey 
of stochastic approximation has been written by Schmetterer C I SU . Briefly, 
the stochastic approximation problem in determining the maximum of a 
regression function may be viewed as the problem of finding a search 
procedure yielding a sequence(Xj) converging (either In probability or 
or almost surely) to x*, where x* Is the unique operating point maximizing 
f. The stochastic approximation setting Is more general than ours in that 
the noise process, while (as In our studies) being independent of earlier 
observations, may be unknown and yet depend on x. But It Is at the same 
time more restrictive than our theory because f must be a function which 
Is unlmodai. There are various other assumptions imposed on both F and 
the noise process; the reader is invited to consult the stochastic 
approximation literature. 
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5. COMMENTS 

The goal In this paper has been to delimit what can be done by sequential 
search procedures when the set of objective functions Is rich enough to 
Include all continuous functions. This goal Is more In the tradition of 
automata theory than numerical analysis. Where possible, we have sought 
bounds to the number of observations needed to accomplish those results that 
can be accomplished. Toward this goal we have revealed several search' 
procedures giving convergence (In various senses) to optimal performance. 

Many of these results, especially In the noisy measurement case, are 
believed to be new. 

For particular numerical problems wherein some prior knowledge of the 
criterion function f Is available, we expect that often heuristic considera- 
tions will yield more rapid convergence than our algorithms. The literature 
suggests that heuristic "creeping search" programs (e.g. Schumer and 
Stelglltz C 19]) have been used for some time. In any event. In computation, • 
once the designer has found the number of searches, N, required to satisfy 
his tolerance of error, If the criterion function possesses any regularity 
whatsoever, It would seem sensible to sample at evenly spaced grid points 
rather than randomly chosen points as per the preceding algorithms. We 
suspect that the procedures we have proposed may have merit If the function 
f Is easily evaluated (such as In linear or quadratic programming problems, 
etc.). Regardless of its computational merits (or lack thereof), the 
preceding analysis should have practical value in pointing out that certain 
search problems which are much more difficult than those currently studied 
are. In principle at least, amenable to solution. 
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Our viewpoint and procedures differ from other approaches to the 
sequential search problem in that the nature of the domain space can be 
suppressed. As noted above, the dimension of X plays little role, and In 
contrast with any other studies, the closeness of the operating poiqt x to 
an optimizing point x* i s of no consequence; it is on the closeness of f(x) 
to f (x*) that our attention focuses. 
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